CQC and NHS both maintain administrative datasets regarding a range of health and social care providers, but the identifier schemes are independent to each of them.
This notebook explores a rough and ready way of rying to reconcile them.
NHS administrative data retrieved from NHS Digital and added to a local database using the recipe described here: NHS and GP Administrative Data.ipynb
Note: for the purposes of this notebook, we could equally just download the NHS adminstrative data file epraccurr
from NHS Digital: GP and GP practice related data.
import pandas as pd
#!mkdir -p data
#!wget http://www.cqc.org.uk/sites/default/files/HSCA%20Active%20Locations.xlsx -P data
Download the CQC Active Locations bulk data file:
cqc_df=pd.read_excel('data/HSCA Active Locations.xlsx',
skiprows=6,
parse_dates=['HSCA start date','Provider HSCA start date'])
cqc_df.head(3)
Location ID | HSCA start date | Care home? | Location Name | Telephone Number | Registered manager (note; where there is more than one manager at a location, only one is included here for ease of presentation. The full list is available if required). | Web Address | Care homes beds | Location Type/Sector | Location Primary Inspection Category | ... | Service user band - Learning disabilities or autistic spectrum disorder | Service user band - Mental Health | Service user band - Older People | Service user band - People detained under the Mental Health Act | Service user band - People who misuse drugs and alcohol | Service user band - People with an eating disorder | Service user band - Physical Disability | Service user band - Sensory Impairment | Service user band - Whole Population | Service user band - Younger Adults | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1-1000210669 | 2013-12-12 | Y | Kingswood House Nursing Home | 01424716303 | Turner, Patricia Anne | NaN | 22.0 | Social Care Org | Residential social care | ... | NaN | Y | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Y |
1 | 1-1000270393 | 2013-10-16 | N | Red Kite Home Care | NaN | Hall, Pearl | NaN | 0.0 | Social Care Org | Community based adult social care services | ... | NaN | Y | Y | NaN | NaN | NaN | Y | NaN | NaN | Y |
2 | 1-1000312641 | 2013-10-18 | N | Human Support Group Limited - Sale | 01619429490 | * | www.homecaresupport.co.uk | 0.0 | Social Care Org | Community based adult social care services | ... | NaN | Y | Y | NaN | Y | Y | Y | Y | NaN | Y |
3 rows × 94 columns
#Identify the name of the postcode column
[c for c in cqc_df.columns if 'post' in c.lower()]
['Postal Code', 'Provider - Postal Code']
Access the NHS GP practice list:
import sqlite3
con = sqlite3.connect("nhsadmin.sqlite")
EPRACCUR='epraccur'
nhs_df=pd.read_sql_query('SELECT * FROM {typ}'.format(typ=EPRACCUR), con)
nhs_df.head()
Organisation Code | Name | National Grouping | High Level Health Geography | Address Line 1 | Address Line 2 | Address Line 3 | Address Line 4 | Address Line 5 | Postcode | ... | Close Date | Status Code | Organisation Sub-Type code | Commissioner | Join Provider/Purchaser Date | Left Provider/Purchaser Date | Contact Telephone Number | Amended Record Indicator | Provider/Purchaser | Prescribing Setting | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A81001 | THE DENSHAM SURGERY | Y54 | Q74 | THE HEALTH CENTRE | LAWSON STREET | STOCKTON-ON-TEES | CLEVELAND | None | TS18 1HU | ... | None | A | B | 00K | 2013-04-01 00:00:00 | None | 01642 672351 | 0 | 00K | 4 |
1 | A81002 | QUEENS PARK MEDICAL CENTRE | Y54 | Q74 | QUEENS PARK MEDICAL CTR | FARRER STREET | STOCKTON ON TEES | CLEVELAND | None | TS18 2AW | ... | None | A | B | 00K | 2013-04-01 00:00:00 | None | 01642 679681 | 0 | 00K | 4 |
2 | A81003 | VICTORIA MEDICAL PRACTICE | Y54 | Q74 | THE HEALTH CENTRE | VICTORIA ROAD | HARTLEPOOL | CLEVELAND | None | TS26 8DB | ... | None | D | B | 00K | 2013-04-01 00:00:00 | None | 01429 272945 | 0 | 00K | 4 |
3 | A81004 | WOODLANDS ROAD SURGERY | Y54 | Q74 | 6 WOODLANDS ROAD | None | MIDDLESBROUGH | CLEVELAND | None | TS1 3BE | ... | None | A | B | 00M | 2013-04-01 00:00:00 | None | 01642 247982 | 0 | 00M | 4 |
4 | A81005 | SPRINGWOOD SURGERY | Y54 | Q74 | SPRINGWOOD SURGERY | RECTORY LANE | GUISBOROUGH | None | None | TS14 7DJ | ... | None | A | B | 00M | 2013-04-01 00:00:00 | None | 01287 619611 | 0 | 00M | 4 |
5 rows × 21 columns
Use the postcode as a crib and then also match on exact name, albeit case insensitive.
epraccurr status code - A
is active.
cqc_df['Location Name']=cqc_df['Location Name'].str.upper()
merge1=pd.merge(nhs_df[nhs_df['Status Code']=='A'][['Organisation Code','Name','Postcode']],
cqc_df[['Location ID','Location Name','Postal Code']],
left_on=['Postcode','Name'], right_on=['Postal Code','Location Name'],indicator=True,how='left')
m1=merge1
m1.head()
Organisation Code | Name | Postcode | Location ID | Location Name | Postal Code | _merge | |
---|---|---|---|---|---|---|---|
0 | A81001 | THE DENSHAM SURGERY | TS18 1HU | 1-540731286 | THE DENSHAM SURGERY | TS18 1HU | both |
1 | A81002 | QUEENS PARK MEDICAL CENTRE | TS18 2AW | 1-540740218 | QUEENS PARK MEDICAL CENTRE | TS18 2AW | both |
2 | A81004 | WOODLANDS ROAD SURGERY | TS1 3BE | 1-549119442 | WOODLANDS ROAD SURGERY | TS1 3BE | both |
3 | A81005 | SPRINGWOOD SURGERY | TS14 7DJ | 1-540766364 | SPRINGWOOD SURGERY | TS14 7DJ | both |
4 | A81006 | TENNANT STREET MEDICAL PRACTICE | TS18 2AT | 1-540785056 | TENNANT STREET MEDICAL PRACTICE | TS18 2AT | both |
The left join
means we retain all the current practices from the NHS Digital listing.
We can filter out the rows that didn't match by searching for records with a left_only
desginator in the merge
column of the merged dataframe.
print('Number of unmatched rows: {}'.format(len(m1[m1['_merge']=='left_only'])))
m1[m1['_merge']=='left_only'].head()
Number of unmatched rows: 7408
Organisation Code | Name | Postcode | Location ID | Location Name | Postal Code | _merge | |
---|---|---|---|---|---|---|---|
5 | A81007 | BANKHOUSE SURGERY | TS24 7PW | NaN | NaN | NaN | left_only |
21 | A81026 | THE LINTHORPE SURGERY | TS5 6HA | NaN | NaN | NaN | left_only |
24 | A81030 | BOROUGH ROAD & NUNTHORPE MEDICAL GROUP | TS1 3RY | NaN | NaN | NaN | left_only |
26 | A81032 | THE GARTH | TS14 7DJ | NaN | NaN | NaN | left_only |
28 | A81034 | THORNABY & BARWICK MEDICAL GROUP | TS17 0EE | NaN | NaN | NaN | left_only |
Use the postcode as a crib and then try to fuzzy match on name, again case insensitive.
Let's see if we can do some partial/fuzzy matching with a relatively high degree of confidence:
from fuzzywuzzy import fuzz
m2=pd.merge(m1[m1['_merge']=='left_only'][['Organisation Code','Name','Postcode']],
cqc_df[['Location ID','Location Name','Postal Code']],
left_on=['Postcode'], right_on=['Postal Code'],indicator=True,how='left')
m2['fuzz']=m2.dropna().apply(lambda x: fuzz.ratio(x['Name'],x['Location Name']),axis=1)
m2['fuzz2']=m2.dropna().apply(lambda x: fuzz.partial_ratio(x['Name'],x['Location Name']),axis=1)
def confdisp(minconf=85,maxconf=100,typ='fuzz',lim=10):
return m2[(m2[typ]>=minconf) & (m2[typ]<=maxconf)].head(lim)
print(len(confdisp(85)))
confdisp(85).head()
10
Organisation Code | Name | Postcode | Location ID | Location Name | Postal Code | _merge | fuzz | fuzz2 | |
---|---|---|---|---|---|---|---|---|---|
5 | A81026 | THE LINTHORPE SURGERY | TS5 6HA | 1-2251705571 | LINTHORPE SURGERY | TS5 6HA | both | 89.0 | 100.0 |
7 | A81030 | BOROUGH ROAD & NUNTHORPE MEDICAL GROUP | TS1 3RY | 1-557690214 | BOROUGH ROAD AND NUNTHORPE MEDICAL GROUP | TS1 3RY | both | 95.0 | 92.0 |
12 | A81034 | THORNABY & BARWICK MEDICAL GROUP | TS17 0EE | 1-544613260 | THORNABY AND BARWICK MEDICAL GROUP | TS17 0EE | both | 94.0 | 91.0 |
14 | A81037 | THE ERIMUS PRACTICE | TS1 2NX | 1-2265850518 | ERIMUS PRACTICE | TS1 2NX | both | 88.0 | 100.0 |
24 | A81039 | EAGLESCLIFFE MEDICAL PRACTICE | TS16 9EA | 1-542237827 | THE EAGLESCLIFFE MEDICAL PRACTICE | TS16 9EA | both | 94.0 | 100.0 |
We can make an interactive explorer using ipywidgets
to explore different levels of confidence:
from ipywidgets import interact
import ipywidgets
interact(confdisp,
minconf=ipywidgets.IntSlider(min=0,max=100,step=1,value=85),
maxconf=ipywidgets.IntSlider(min=0,max=100,step=1,value=100),
typ=ipywidgets.RadioButtons(options=['fuzz', 'fuzz2']));
We can also explore other fuzzy match combinations - note how we might be able to improve matters if we remove subject specific stop words e.g. Suregry, Medical Centre etc.
m2[(m2['fuzz']<=85) & (m2['fuzz2']>=85)].head()
Organisation Code | Name | Postcode | Location ID | Location Name | Postal Code | _merge | fuzz | fuzz2 | |
---|---|---|---|---|---|---|---|---|---|
9 | A81032 | THE GARTH | TS14 7DJ | 1-571280521 | THE GARTH SURGERY | TS14 7DJ | both | 69.0 | 100.0 |
21 | A81038 | HIRSEL MEDICAL CENTRE | TS3 6AL | 1-566800129 | KINGS MEDICAL CENTRE | TS3 6AL | both | 83.0 | 85.0 |
27 | A81040 | MARSH HOUSE MEDICAL PRACTICE | TS23 2DG | 1-543965910 | MARSH HOUSE MEDICAL CENTRE | TS23 2DG | both | 85.0 | 85.0 |
28 | A81044 | MCKENZIE HOUSE SURGERY | TS25 1QU | 1-549723762 | MCKENZIE HOUSE | TS25 1QU | both | 78.0 | 100.0 |
61 | A82021 | CASTLEGATE AND DERWENT SURGERY | CA13 9HT | 1-1522911257 | DERWENT SURGERY | CA13 9HT | both | 67.0 | 100.0 |